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We analyze the generalization performance of a student in a model composed of nonlinear 
perceptrons: a true teacher, ensemble teachers, and the student. We calculate the general- 
ization error of the student analytically or numerically using statistical mechanics in the 
framework of on-line learning. We treat two well-known learning rules: Hebbian learning 
and perceptron learning. As a result, it is proven that the nonlinear model shows quali- 
tatively different behaviors from the linear model. Moreover, it is clarified that Hebbian 
learning and perceptron learning show qualitatively different behaviors from each other. In 
Hebbian learning, we can analytically obtain the solutions. In this case, the generalization 
error monotonically decreases. The steady value of the generalization error is independent 
of the learning rate. The larger the number of teachers is and the more variety the ensemble 
teachers have, the smaller the generalization error is. In perceptron learning, we have to 
numerically obtain the solutions. In this case, the dynamical behaviors of the generalization 
error are non-monotonic. The smaller the learning rate is, the larger the number of teachers 
is; and the more variety the ensemble teachers have, the smaller the minimum value of the 
generalization error is. 
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1. Introduction 

Learning is to infer the underlying rules that dominate data generation using observed 
data. Observed data are input-output pairs from a teacher and are called examples. Learning 
can be roughly classified into batch learning and on-line learning. 1 In batch learning, given 
examples are used more than once. In this paradigm, a student becomes to give correct 
answers after training if the student has had adequate freedom. However, it is necessary to 
have a long time and a large memory in which to store many examples. On the contrary, in 
on-line learning, examples once used are discarded. In this case, a student cannot give correct 
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answers for all examples used in training. However, there are merits. For example, a large 
memory for storing many examples isn't necessary, and it is possible to follow a time-variant 
teacher. 

Recently, we 2 ' 3 analyzed the generalization performance of ensemble learning 4-6 in a 
framework of on-line learning using a statistical mechanical method. 1 ' 10 Using the same 
method, we also analyzed the generalization performance of a student supervised by a moving 
teacher that goes around a true teacher. 7 ' 8 As a result, it was proven that the generalization 
error of a student can be smaller than that of a moving teacher, even if the student only 
uses examples from the moving teacher. In an actual human society, a teacher observed by 
a student does not always present the correct answer. In many cases, the teacher is learning 
and continues to change. Therefore, the analysis of such a model is interesting for considering 
the analogies between statistical learning theories and an actual human society. 

On the other hand, in most cases in an actual human society, a student can observe 
examples from two or more teachers who differ from each other. Therefore, we analyze the 
generalization performance of such a model and discuss the use of imperfect teachers in this 
paper. That is, we consider a true teacher and K teachers called ensemble teachers who exist 
around the true teacher. A student uses input-output pairs from ensemble teachers in turn or 
randomly. 

A model in which the true teacher, the ensemble teachers and the student are all linear 
perceptrons with noise has already been solved analytically 9 D In that case, it was proven that 
when the student's learning rate satisfies 77 < 1, the larger the number K of ensemble teachers 
is and the more variety the ensemble teachers have, the smaller the student's generalization 
error is. On the other hand, when 77 > 1, the properties are completely reversed. If the variety 
of ensemble teachers is rich enough, the direction cosine between the true teacher and the 
student becomes unity in the limit of 77 — > and K — > 00. 

However, linear perceptrons are somewhat special as neural networks or learning machines. 
Nonlinear perceptrons are more common than linear ones. Therefore, we analyze the general- 
ization performance of a student in a model composed of nonlinear perceptrons, a true teacher, 
ensemble teachers, and the student. We obtain order parameters and the generalization errors 
analytically or numerically in the framework of on-line learning using a statistical mechanical 
method. We treat two well-known learning rules: Hebbian learning and perceptron learning. 
As a result, it is proven that the nonlinear model shows qualitatively different behaviors from 
the linear model. Moreover, it is clarified that Hebbian learning and perceptron learning show 
qualitatively different behaviors from each other. In Hebbian learning, we can analytically ob- 
tain the solutions. In this case, the generalization error monotonically decreases. The steady 
value of the generalization error is independent of the learning rate 77. The larger the number 
K of teachers is and the more variety the ensemble teachers have, the smaller the general- 
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ization error is. In perceptron learning, we have to numerically obtain the solutions. In this 
case, the dynamical behaviors of the generalization error are non-monotonic. The smaller the 
learning rate ij is, the larger the number K of teachers is; and the more variety the ensemble 
teachers have, the smaller the minimum value of the generalization error is. 

2. Model 

In this paper, we consider a true teacher, K ensemble teachers and a student. They are 
all nonlinear perceptrons with connection weights A, Bk and J, respectively. Here, k = 
1,...,K. For simplicity, the connection weights of the true teacher, the ensemble teachers 
and the student are simply called the true teacher, the ensemble teachers and the student, 
respectively. True teacher A = (A±, . . . , An), ensemble teachers Bk = (Bki, ■ ■ ■ , B^n), student 
J = (Ji, . . . , Jn) and input x = (xi, . . . ,xn) are iV-dimensional vectors. Each component 
Ai of A is drawn from A/"(0, 1) independently and fixed, where A/"(0, 1) denotes Gaussian 
distribution with a mean of zero and a variance of unity. Some components Bki are equal 
to Ai multiplied by -1, and the others are equal to Ai. Which component Bki is equal to 
—Ai is independent of the value of Ai. Hence, Bki also obeys M(0, 1). Bki is also fixed. The 
direction cosine between Bk and A is Rbu and that between Bk and B^ is quk 1 - Each of the 
components Jf of the initial value J° of J is drawn from J\f(0, 1) independently. The direction 
cosine between J and A is Rj and that between J and B/, is Rbuj- Each component Xi of x 
is drawn from M(0, 1/N) independently. Thus, 

(A) = 0, ((A,i) 2 ) = l, (1) 
(B ki ) = 0, ((B kl ) 2 ) = l, (2) 



(^) = 0, ((Jf) 2 ) = l, (3) 

(Xi) = 0, ((^) 2 ) = ^, (4) 

A ■ B^ B^ ■ Bk' 

Rm = II AWWn II 1 Qkk ' = II P HHP || ' ( 5 ) 
|| || || .£5 || ||£>j.|| \\-L>k' || 

o A J B k J ... 

Rj - piipjf' Rbu= \\B k \\\\j\y (6) 

where (•) denotes a mean. Figure 1 illustrates the relationship among true teacher A, ensemble 
teachers Bk, student J and direction cosines q k k' , RBki Rj and Rbuj- 

In this paper, the thermodynamic limit N — > oo is also treated. Therefore, 

\\A\\=VN, \\B k \\ = y/N, \\J°\\ = Vn, ||x|| = 1. (7) 

Generally, a norm || J\\ of the student changes as the time step proceeds. Therefore, ratios 
l m of the norm to y/N are introduced and called the length of the student. That is, || J m \\ = 
l m \fN , where m denotes the time step. 
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The internal potentials y m of the true teacher, vf? of the ensemble teachers, and u m l m of 
the student are 



y 



A-x rr \ 
B k ■ x m , 
J m ■ x m , 



(8) 
(9) 
(10) 



respectively. Here, y m , vT 1 and u m obey the Gaussian distributions with means of zero and 
the covariance matrix S: 

/ 1 R Bk Rj ^ 
S = R Bk 1 R BkJ . (11) 

V Rj RBU i / 

The outputs of the true teacher, the ensemble teachers, and the student are 
sgn(y m ), sgn(w^) and sgn(« m Z m ), respectively. Here, sgn(-) is a sign function defined as 

Bgn(.) = \ +h « 20 - (12) 
[ -1, z < 0. 

In the model treated in this paper, the student J is updated using an input x and the 
outputs of ensemble teachers B k for the input. That is, 

jm+l = jm + jm x m^ ^gj 

where / m denotes a function that represents the update amount and is determined by the 
learning rule. In the well-known learning rules for nonlinear perceptrons, Hebbian learning 
and perceptron learning, f m are 

f m = VSgn(v m ), (14) 
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f m = v Q(-u m v m )sgn(v m ), (15) 

respectively. Here, rj is the learning rate of the student and is constant. ©(•) is a step function 
defined as 

e W = { «• ' >- ° n da) 

[0, z < 0. 

3. Theory 

3.1 Generalization error 

A goal of statical learning theory is to theoretically obtain generalization errors. We use 

e m = Q(-y m u m ) (17) 

as the error of the student. The superscripts m, which represent the time step, are omitted 
for simplicity unless stated otherwise. Since the generalization error is the mean of errors for 
the true teacher over the distribution of new input, generalization error e 9 of student J is 
calculated as follows: 

' dxP(x)e (18) 
dyduP(y,u)e(y,u) (19) 



/ 



= Itan-lV^pZ. (20) 

7T Rj 

Here, integration has been executed using the following: y and u obey A/"(0, 1). The covariance 
between y and u is Rj. 

3. 2 Differential equations for order parameters 

To simplify the analysis, the following auxiliary order parameters are introduced: 

rj = Rjl, (21) 
r B kJ = RBkjl- (22) 

Simultaneous differential equations in deterministic forms, 10 which describe the dynamical 
behaviors of order parameters, have been obtained based on self-averaging in the thermody- 
namic limits as follows: 

K 

(23) 

k'=l 

E</*y>, (24) 

k=l 
K 



dr B u 


1 


dt 


~ K 


drj 


1 


dt 


~ K 


dl 


1 


dt 


~ K 



\ E (</*«> + ■ ( 25 ) 
k=\ ^ * 
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Here, dimension N has been treated to be sufficiently greater than the number K of ensemble 
teachers. Time is defined by t = m/N, that is, time step m normalized by dimension N . Note 
that the above differential equations are identical whether the K ensemble teachers are used 
in turn or randomly. 



3.3 Hebbian learning 

Since y, v and u obey the triple Gaussian distribution with means of zero and the covari- 
ance matrix of eq. (11), the four sample averages that appear in eqs. (23)-(25) in Hebbian 
learning can be calculated using eq.(14) as follows: 

(fk'Vk) = V~j=, (26) 



<*V>=^. (27) 



(Au) = n 2 -^f, (28) 



(fk) = V 2 . (29) 

Since all components Aj, jf of true teacher A, and the initial student J° are drawn from 
Af(0, 1) independently and because the thermodynamic limit N — > oo is also treated, they are 
orthogonal to each other in the initial state. That is, 

Rj = when t = 0. (30) 

In addition, 

I = 1 when t = 0. (31) 

Using eqs. (26)-(31), the simultaneous differential equations (23)-(25) can be solved an- 
alytically as follows: 

(32) 



rskJ 



rj 



k=i \ k'=i / 

3.4 Perceptron learning 

Since y, v and u obey the triple Gaussian distribution with means of zero and the covari- 
ance matrix of eq. (11), the four sample averages that appear in eqs. (23)-(25) in perceptron 
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learning can be calculated using eq. (15) as follows: 

, n \ Qkk' - R-BkJ / oc n 

(fk'Vk) = V -j= , (35) 



</«/> = (36) 



{fkU)= «^ir> (37) 



(/D^tan-^k. (38) 
rtBkJ 

Since the simultaneous differential equations cannot be solved analytically in this case, we 
solve these equations numerically. 

4. Results and Discussion 

In this section, we treat the case where the direction cosines Rsk between the ensemble 
teachers and the true teacher, and the direction cosines qkk' among the ensemble teachers are 
uniform. That is, 

R Bk = Rb, k = l,...,K, (39) 
( q, k^k', 

Qkk' = , ' (40) 

[1, k = k . 

In Hebbian learning, since order parameters are analytically obtained, we can understand 
the dynamical behaviors clearly and deeply. Considering eqs. (21), (33), (34), (39) and (40)C 
Rj is obtained as follows: 

Rj= , Rb , (41) 

K + 2 + t ) 

Equation (41) shows the following: the dynamical behaviors of Rj are monotonically 
increasing. The larger the learning rate r\ is, the larger the direction cosine Rj is. Rj in the 
limit of t — > oo is obtained as follows: 

Rj - 1 = = -J^- (42) 

This equation shows that the steady state value of Rj is independent of the learning rate 
rj. The larger the number K of ensemble teachers is and the smaller the direction cosine q 
among ensemble teachers is, the larger the steady state value of Rj is. 

Considering that the generalization error e g calculated by eq.(20) monotonically decreases 
as Rj increases, e g in the case of Hebbian learning monotonically decreases. The larger r] is, 
the smaller e g is in the transient phase. The steady state value of e g is independent of 77. 
However, the larger the number K is and the smaller q is, the smaller the steady state value 
of e g is. Therefore, the larger the number of teachers is and the more variety the ensemble 
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Fig. 2. Dynamical behaviors of generalization error e g . Hebbian learning. Theory and computer sim- 
ulations. Conditions other than r\ are K = 10, q = 0.49 and Rb = 0.7. 
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Fig. 3. Steady state value of generalization error e g . Hebbian learning. Theory and computer simu- 
lations. Rb = 0.7. When q = Rb 2 and K = oo, the steady state value of e g is zero. 



Equation (42) shows Rj — > Rb/ \/q in the limit of K —> oo. On the other hand, when S 
and T are generated independently under conditions where the direction cosine between S 
and P and between T and P are both Rq, where S, T and P are high dimensional vectors, 
the direction cosine between S and T is go = Rq, as shown in the appendix. Therefore, if 
ensemble teachers have enough variety that they have been generated independently under 
the condition that all direction cosines between ensemble teachers and the true teacher are 
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Rb, Rb/^/q = 1) then the direction cosine Rj between the student and the true teacher 
approaches unity in the limit of K — > oo. Then, the generalization error approaches zero. 

The dynamical behaviors of generalization error e g have been analytically obtained by 
eqs.(20) and (41) in Hebbian learning. Figures 2 and 3 show the analytical results of e g and 
the steady state value of e g with corresponding simulation results. In computer simulations, 
the dimension N = 2000 and K ensemble teachers are used in turn. The generalization error 
e g was obtained by test for 10 random inputs at each time step. In these figures, the curves 
represent theoretical results. The symbols represent simulation results. In Fig. 2, conditions 
other than rj are common: K = 10, q = 0.49 and Rb = 0.7. In Fig. 3, only Rb is common: 
Rb = 0.7. The former discussions are confirmed in these figures. 

On the other hand, in perceptron learning, we cannot solve eqs.(23)-(25) analytically. 
Therefore, we obtain the solutions numerically. The dynamical behaviors of generalization 
errors e g are shown in Figs. 4-6. 

In Fig. 4, conditions other than r/ are K = 10, q = 0.49 and Rb = 0.7. In Fig. 5, conditions 
other than K are r\ = 0.2, q = 0.49 and Rb = 0.7. In Fig. 6, conditions other than q are 
K = 10,7/ = 0.2 and Rb = 0.7. Figure 4 shows that the dynamical behaviors of e g have 
non-monotonic properties when the learning rate rj is relatively small. However, Figs. 5 and 6 
show that the steady state value of the generalization error is independent of K and q. These 
are remarkable differences from the properties of Hebbian learning. 
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Fig. 4. Dynamical behaviors of generalization error e g . Perceptron learning. Theory and computer 
simulations. Conditions other than r\ are K = 10, q = 0.49, Rb = 0.7. 

When the learning rate rj is relatively small, the minimum value e g (mm) of the general- 
ization error exists and the smaller ry is, the smaller e 9 (min) is. The relationships between K 
and e g (min), and q and e 9 (min) are shown in Figs. 7 and 8, respectively. In Fig.7, conditions 
other than rj are q = 0.49 and Rb = 0.7. In Fig.8, conditions other than rj are K = 10 and 
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Fig. 5. Dynamical behaviors of generalization error e g . Perceptron learning. Theory and computer 
simulations. Conditions other than K are r\ = 0.2, q = 0.49 and Rb = 0.7. 
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Fig. 6. Dynamical behaviors of generalization error e g . Perceptron learning. Theory and computer 
simulations. Conditions other than q are K = 10, t] = 0.2 and Rb — 0.7. 



Rb = 0.7. These figures show that the larger the number K is and the smaller the direction 
cosine q is, the smaller the minimum value of generalization errors is. In other words, the 
larger the number of teachers is and the more variety the ensemble teachers have, the more 
clever the student can become. 

In the case of the linear model, 9 the properties were able to be summarized as follows: 
The smaller rj is, the smaller the steady state value of e g is. When the learning rate satisfies 
T) < 1, the larger K is and the smaller q is, the smaller the steady state value of e e is. On the 
contrary, when rj > 1, the properties are completely reversed. 9 Comparing the linear model 
and the nonlinear model treated in this paper, there are qualitatively different properties. 
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Fig. 7. Relationship between K and minimum values e 9 (min) of generalization error. Perceptron 
learning. Theory, q = 0.49, R B = 0.7. 




5. Conclusion 

We have analyzed the generalization performance of a student in a model composed of 
nonlinear perceptrons: a true teacher, ensemble teachers, and the student. We have calculated 
the generalization error of the student analytically or numerically using statistical mechanics 
in the framework of on-line learning. We have treated two well-known learning rules: Hebbian 
learning and perceptron learning. As a result, it has been proven that the nonlinear model 
shows qualitatively different behaviors from the linear model. Moreover, it has been clarified 
that Hebbian learning and perceptron learning show qualitatively different behaviors from 
each other. In Hebbian learning, we have analytically obtained the solutions. In this case, the 
generalization error monotonically decreases. The steady value of the generalization error is 
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independent of the learning rate. The larger the number of teachers is and the more variety the 
ensemble teachers have, the smaller the generalization error is. In perceptron learning, we have 
obtained the solutions numerically. In this case, the dynamical behaviors of the generalization 
error are non-monotonic. The smaller the learning rate is, the larger the number of teachers 
is, and the more variety the ensemble teachers have, the smaller the minimum value of the 
generalization error is. 
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Appendix: Direction cosine q among ensemble teachers 

Let us consider the case where S and T are generated independently, satisfying the con- 
dition that direction cosines between S and P and between T and P are both Rq, as shown 
in Fig. A-l, where S, T and P are N dimensional vectors. In this figure, the inner product 
of s and t is 



where s and t are projections from S to the orthogonal complement C of X and from T to 
C, respectively, qo denotes the direction cosine between S and T. 

Incidentally, if dimension N is large and S and T have been generated independently, s 
and t should be orthogonal to each other. Therefore, qo = Rq. 
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Fig. A-l. Direction cosine among ensemble teachers. 
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